etcdserver: adjust election timeout on restart #9364

gyuho · 2018-02-26T22:10:37Z

Still advance ticks on bootstrapping to fresh cluster.
But on restart, only advance 1/10 of original election ticks.

Address #9333.

Manually tested that it adjusts election ticks.

codecov-io · 2018-02-27T00:29:44Z

Codecov Report

Merging #9364 into master will increase coverage by 0.39%.
The diff coverage is 98.48%.

@@            Coverage Diff             @@
##           master    #9364      +/-   ##
==========================================
+ Coverage   72.36%   72.75%   +0.39%     
==========================================
  Files         362      362              
  Lines       30795    30846      +51     
==========================================
+ Hits        22285    22443     +158     
+ Misses       6869     6787      -82     
+ Partials     1641     1616      -25

Impacted Files	Coverage Δ
rafthttp/remote.go	`71.42% <100%> (ø)`	⬆️
rafthttp/peer_status.go	`88.88% <100%> (+2.68%)`	⬆️
etcdserver/membership/cluster.go	`87.2% <100%> (+0.21%)`	⬆️
etcdserver/raft.go	`89.36% <100%> (ø)`	⬆️
etcdserver/server.go	`80.41% <100%> (+1.16%)`	⬆️
rafthttp/peer.go	`90.22% <100%> (+1.5%)`	⬆️
rafthttp/transport.go	`83.06% <75%> (-0.19%)`	⬇️
raft/node.go	`90.62% <0%> (-1.79%)`	⬇️
etcdctl/ctlv3/command/printer_simple.go	`72.72% <0%> (-1.4%)`	⬇️
lease/lessor.go	`85.81% <0%> (-0.81%)`	⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5132017...3257823. Read the comment docs.

wojtek-t · 2018-02-27T11:33:43Z

/subscribe

cc @mborsz

xiang90 · 2018-03-01T23:30:04Z

the approach seems fine. can someone reproduce the observed problem with/without this patch to make sure the problem is fixed by the patch?

xiang90 · 2018-03-01T23:31:39Z

etcdserver/raft.go

@@ -417,7 +407,6 @@ func startNode(cfg ServerConfig, cl *membership.RaftCluster, ids []types.ID) (id
 	raftStatusMu.Lock()
 	raftStatus = n.Status
 	raftStatusMu.Unlock()
-	advanceTicksForElection(n, c.ElectionTick)


we still should advanceTicks for newly start node. is there a reason not to do so?

xiang90 · 2018-03-01T23:34:56Z

etcdserver/server.go

@@ -521,9 +523,54 @@ func NewServer(cfg ServerConfig) (srv *EtcdServer, err error) {
 	}
 	srv.r.transport = tr

+	activePeers := 0
+	for _, m := range cl.Members() {


establishing connection can take time. probably need some delay here.

xiang90 · 2018-03-01T23:36:09Z

etcdserver/server.go

+		plog.Infof("%s is advancing %d ticks for faster election (election tick %d)", srv.ID(), tick, cfg.ElectionTicks)
+		advanceTicksForElection(n, tick)
+	} else {
+		// on restart, there is likely an active peer already


even for restart case, we should still consider the number of active member. if there is none, we still can advance ticks, no?

Yea, we would need wait until the local node finds its peers (cl.Members() > 0), to do that. I will play around it to address https://github.com/coreos/etcd/pull/9364/files#r171727961.

wojtek-t · 2018-03-02T11:48:30Z

@xiang90 @gyuho - we have patched this PR internall and will run tests with that;
we will get back to you probably some time early next week

gyuho · 2018-03-03T00:09:13Z

@xiang90 @wojtek-t

Last four commits add more detailed logging and better estimation on active peers.

Out of 3-node cluster,

Case 1. All 3 nodes start fresh (bootstrapping). In this case, fast-forward election ticks with last tick left.

embed: 729934363faa4a24 is advancing 9 ticks for faster election (election tick 10, restarted 'false', 3 found peer(s))

embed: 729934363faa4a24 started serving peers with 1 found peer(s) (no fast-forwarding election ticks after serving peers)

Case 2. Only 2 nodes are up, 1 node is down. The 1-node restarted. In this case, do not advance election ticks.

embed: 729934363faa4a24 started serving peers with 1 found peer(s) (no fast-forwarding election ticks after serving peers)

Case 3. All 3 nodes are down. Third node restarts with no active peer.
In this case, it advances only 1 tick out of 10 ticks (because there is no active peer).

embed: 7339c4e5e833c029 is advancing 1 ticks after serving peers (election tick 10, restarted 'true', 0 found peer(s))

xiang90 · 2018-03-03T00:26:47Z

Why do we need to differentiate restart vs fresh start?

The strategy should be simple as this.

if single node cluster -> advance ticks (only self can be leader anyway)
if not single node cluster, waits for the peer connection reports.
- If all the connection are failed, do nothing (needs to wait for other peer to be online anyway; advancing ticks here have no effect at all)
- If there is any active peers, advance ticks (if leader exists, this peer should receive a heartbeat soon; or we should start voting)

The only change we should do in this PR is to wait for the peer to be connected or the first connection is failed before advance ticks.

xiang90 · 2018-03-03T00:32:35Z

The most serious problem before is just that we failed to wait for the connection status before advancing ticks.

gyuho · 2018-03-03T02:07:14Z

@xiang90 I differentiated fresh cluster to get away with waiting, since its member list is already populated on start. But you are right, we can simplify this (since we also have discovery services on fresh cluster).

waits for the peer connection reports

Will make server wait up to 5 second, which is rafthttp.ConnReadTimeout, to simplify this logic.

If there is any active peers, advance ticks (if leader exists, this peer should receive a heartbeat soon; or we should start voting)

You mean advancing with adjusted ticks, right? Rejoining node to existing cluster can still have >1 active peers after 5-second wait time. If we have only one tick left, it can be still disruptive when the last tick elapse before leader heartbeat. Fast-forwarding with election tick / 10 (or any other adjustment) is the number that I am experimenting with.

Will clean this up.

xiang90 · 2018-03-03T02:14:08Z

Will make server wait up to 5 second, which is rafthttp.ConnReadTimeout, to simplify this logic.

Blindly waiting for 5 seconds is bad. Peer might be connected well before 5 seconds.

gyuho · 2018-03-03T02:18:45Z

@xiang90

Blindly waiting for 5 seconds is bad. Peer might be connected well before 5 seconds.

Yeah, I was thinking of adding notify routine from rafthttp, so that we discover the connectivity earlier.

xiang90 · 2018-03-03T02:19:59Z

If we have only one tick left, it can be still disruptive when the last tick elapse before leader heartbeat.

then forward it until there are two ticks. Leader should send a heartbeat within one tick. Giving it one more tick should be enough.

gyuho · 2018-03-03T02:21:09Z

Giving it one more tick should be enough.

Sounds good. Will work on it.

Thanks!

gyuho · 2018-03-05T21:20:58Z

Now

fresh node (to 3-node cluster)

etcdserver: b548c2511513015 fast-forwarding 9 ticks (election ticks 10) with 3 found member(s)

restart node (to 3-node cluster)

rafthttp: peer 729934363faa4a24 became active
etcdserver: b548c2511513015 fast-forwarding 8 ticks (election ticks 10) with 3 found member(s)

restarted single-node cluster

etcdserver: 8e9e05c52164694d waited 5s but no active peer found (or restarted 1-node cluster)

xiang90 · 2018-03-05T21:43:08Z

restarted single-node cluster

This we do not fast forward ticks for this case?

The strategy should be simple as this.

if single node cluster -> advance ticks (only self can be leader anyway)

if not single node cluster, waits for the peer connection reports.
If all the connection are failed, do nothing (needs to wait for other peer to be online anyway; advancing ticks here have no effect at all)
If there is any active peers, advance ticks (if leader exists, this peer should receive a heartbeat soon; or we should start voting)

if we follow this policy, it should be fast forwarded, no?

gyuho · 2018-03-05T21:46:11Z

if we follow this policy, it should be fast forwarded, no?

restarted single-node cluster

This we do not fast forward ticks for this case?

I left it as TODO for now.

Let me see if we can also handle the single-node case as well.

mborsz · 2018-03-08T13:59:02Z

With @wojtek-t we have patched this pull request (the version from March 2nd: commits fe9b909 and 5405135) internally and tested this week.
This patch seems to work -- I haven't seen any member rejoining etcd cluster that triggered leadership election.

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>

gyuho · 2018-03-08T19:40:37Z

We've made adjust logic more fine-grained so that it can handle the restarting 1-node cluster. It would be great if we can confirm with latest commits as well. Thanks.

xiang90 · 2018-03-08T21:51:09Z

etcdserver/server.go

@@ -527,6 +539,62 @@ func NewServer(cfg ServerConfig) (srv *EtcdServer, err error) {
 	}
 	srv.r.transport = tr

+	// fresh start
+	if !haveWAL {


why do we need to care about restart vs fresh start?

see #9364 (comment).

Just easier, so that fresh start does not need to synchronize with peer connection reports. But as you suggested, let me simplify the logic (#9364 (comment)).

xiang90 · 2018-03-08T21:53:15Z

etcdserver/server.go

+
+	srv.goAttach(func() {
+		select {
+		case <-cl.InitialAddNotify():


this is pretty complicated. let us just get the peer list from the existing snapshot. we do not need to ensure all the configuration in the wal file are executed.

the reason for that is reconfiguration in infrequent. and moving from one -> N nodes cluster is even more infrequent. snapshot will contain the correct information 99% of the time.

I was trying to cover all cases where there's no snapshot (which needs to populate member lists from WAL). But, agree that this should be simplified by loading members from snapshot. Will rework on this.

gyuho · 2018-03-08T22:12:50Z

Xiang has a good point. This is a bit too complicated. I will create a separate PR with simpler solution.

gyuho · 2018-03-10T18:45:55Z

This will be replaced by #9415.

gyuho added the WIP - DO NOT MERGE label Feb 26, 2018

gyuho force-pushed the sync-rafthttp branch 2 times, most recently from 5bfa52a to fe9bff5 Compare February 26, 2018 22:34

gyuho removed the WIP - DO NOT MERGE label Feb 26, 2018

gyuho mentioned this pull request Feb 27, 2018

rafthttp: make "ActiveSince" non-blocking on write lock #9366

Merged

gyuho force-pushed the sync-rafthttp branch 3 times, most recently from d63b940 to 5405135 Compare March 1, 2018 23:21

xiang90 reviewed Mar 1, 2018

View reviewed changes

gyuho force-pushed the sync-rafthttp branch from 5405135 to b26594c Compare March 3, 2018 00:07

gyuho added the WIP - DO NOT MERGE label Mar 3, 2018

gyuho force-pushed the sync-rafthttp branch from b26594c to 73ce98b Compare March 5, 2018 21:20

gyuho removed the WIP - DO NOT MERGE label Mar 5, 2018

gyuho force-pushed the sync-rafthttp branch 2 times, most recently from 1f03eea to 9d4440d Compare March 6, 2018 16:10

gyuho removed the WIP - DO NOT MERGE label Mar 6, 2018

gyuho mentioned this pull request Mar 7, 2018

*: configure Raft Pre-Vote to reduce disruptive rejoining servers #9352

Merged

gyuho force-pushed the sync-rafthttp branch 4 times, most recently from d02b04e to eab6108 Compare March 7, 2018 22:34

gyuho mentioned this pull request Mar 8, 2018

Release planning for March #9411

Closed

25 tasks

gyuho added 6 commits March 8, 2018 11:06

rafthttp: add "InitialPeerNotify"

4f67bea

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>

etcdserver: add "InitialPeerNotify" to "*nopTransporterWithActiveTime"

58f5219

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>

etcdserver: protect raft.Node.Tick with mutex

a97b9e2

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>

etcdserver: move "advanceRaftTicks" to "NewServer"

6371853

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>

etcdserver/membership: add "InitialAddNotify"

1d215d8

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>

etcdserver: adjust election ticks on restart

3257823

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>

gyuho force-pushed the sync-rafthttp branch from eab6108 to 3257823 Compare March 8, 2018 19:09

jpbetz added the Release-Backport/v3.2 label Mar 8, 2018

gyuho added the backport/v3.3 label Mar 8, 2018

xiang90 reviewed Mar 8, 2018

View reviewed changes

gyuho added the WIP - DO NOT MERGE label Mar 8, 2018

gyuho mentioned this pull request Mar 9, 2018

etcdserver: adjust election timeout on restart #9415

Merged

gyuho removed Release-Backport/v3.2 backport/v3.3 labels Mar 10, 2018

gyuho closed this Mar 10, 2018

gyuho deleted the sync-rafthttp branch March 14, 2018 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: adjust election timeout on restart #9364

etcdserver: adjust election timeout on restart #9364

gyuho commented Feb 26, 2018 •

edited

Loading

codecov-io commented Feb 27, 2018 •

edited

Loading

wojtek-t commented Feb 27, 2018

xiang90 commented Mar 1, 2018

xiang90 Mar 1, 2018

xiang90 Mar 1, 2018

xiang90 Mar 1, 2018

gyuho Mar 2, 2018

wojtek-t commented Mar 2, 2018

gyuho commented Mar 3, 2018

xiang90 commented Mar 3, 2018 •

edited

Loading

xiang90 commented Mar 3, 2018

gyuho commented Mar 3, 2018

xiang90 commented Mar 3, 2018

gyuho commented Mar 3, 2018

xiang90 commented Mar 3, 2018

gyuho commented Mar 3, 2018

gyuho commented Mar 5, 2018

xiang90 commented Mar 5, 2018

gyuho commented Mar 5, 2018

mborsz commented Mar 8, 2018

gyuho commented Mar 8, 2018

xiang90 Mar 8, 2018

gyuho Mar 8, 2018

xiang90 Mar 8, 2018

xiang90 Mar 8, 2018

gyuho Mar 8, 2018

gyuho commented Mar 8, 2018

gyuho commented Mar 10, 2018

etcdserver: adjust election timeout on restart #9364

etcdserver: adjust election timeout on restart #9364

Conversation

gyuho commented Feb 26, 2018 • edited Loading

codecov-io commented Feb 27, 2018 • edited Loading

Codecov Report

wojtek-t commented Feb 27, 2018

xiang90 commented Mar 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Mar 2, 2018

gyuho commented Mar 3, 2018

xiang90 commented Mar 3, 2018 • edited Loading

xiang90 commented Mar 3, 2018

gyuho commented Mar 3, 2018

xiang90 commented Mar 3, 2018

gyuho commented Mar 3, 2018

xiang90 commented Mar 3, 2018

gyuho commented Mar 3, 2018

gyuho commented Mar 5, 2018

xiang90 commented Mar 5, 2018

gyuho commented Mar 5, 2018

mborsz commented Mar 8, 2018

gyuho commented Mar 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho commented Mar 8, 2018

gyuho commented Mar 10, 2018

gyuho commented Feb 26, 2018 •

edited

Loading

codecov-io commented Feb 27, 2018 •

edited

Loading

xiang90 commented Mar 3, 2018 •

edited

Loading